early training dynamic
- North America > United States > Maryland > Prince George's County > College Park (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- Oceania > Australia > New South Wales > Sydney (0.04)
- (2 more...)
Phase diagram of early training dynamics in deep neural networks: effect of the learning rate, depth, and width
We systematically analyze optimization dynamics in deep neural networks (DNNs) trained with stochastic gradient descent (SGD) and study the effect of learning rate $\eta$, depth $d$, and width $w$ of the neural network. By analyzing the maximum eigenvalue $\lambda^H_t$ of the Hessian of the loss, which is a measure of sharpness of the loss landscape, we find that the dynamics can show four distinct regimes: (i) an early time transient regime, (ii) an intermediate saturation regime, (iii) a progressive sharpening regime, and (iv) a late time edge of stability regime.
- North America > United States > Maryland > Prince George's County > College Park (0.04)
- Oceania > Australia > New South Wales > Sydney (0.04)
- North America > United States > California > San Diego County > San Diego (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Phase diagram of early training dynamics in deep neural networks: effect of the learning rate, depth, and width
We systematically analyze optimization dynamics in deep neural networks (DNNs) trained with stochastic gradient descent (SGD) and study the effect of learning rate \eta, depth d, and width w of the neural network. By analyzing the maximum eigenvalue \lambda H_t of the Hessian of the loss, which is a measure of sharpness of the loss landscape, we find that the dynamics can show four distinct regimes: (i) an early time transient regime, (ii) an intermediate saturation regime, (iii) a progressive sharpening regime, and (iv) a late time "edge of stability" regime. We identify several critical values of c, which separate qualitatively distinct phenomena in the early time dynamics of training loss and sharpness. Notably, we discover the opening up of a "sharpness reduction" phase, where sharpness decreases at early times, as d and 1/w are increased.
Phase diagram of early training dynamics in deep neural networks: effect of the learning rate, depth, and width
Kalra, Dayal Singh, Barkeshli, Maissam
We systematically analyze optimization dynamics in deep neural networks (DNNs) trained with stochastic gradient descent (SGD) and study the effect of learning rate $\eta$, depth $d$, and width $w$ of the neural network. By analyzing the maximum eigenvalue $\lambda^H_t$ of the Hessian of the loss, which is a measure of sharpness of the loss landscape, we find that the dynamics can show four distinct regimes: (i) an early time transient regime, (ii) an intermediate saturation regime, (iii) a progressive sharpening regime, and (iv) a late time ``edge of stability" regime. The early and intermediate regimes (i) and (ii) exhibit a rich phase diagram depending on $\eta \equiv c / \lambda_0^H $, $d$, and $w$. We identify several critical values of $c$, which separate qualitatively distinct phenomena in the early time dynamics of training loss and sharpness. Notably, we discover the opening up of a ``sharpness reduction" phase, where sharpness decreases at early times, as $d$ and $1/w$ are increased.
- North America > United States > Maryland > Prince George's County > College Park (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- Oceania > Australia > New South Wales > Sydney (0.04)
- (2 more...)